This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). In this exercise, I will explore a data set on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Here it shows mean, median and other statistical factors of each variable. Quality’s median value is 6 and mean value is 5.636. Mean and median is quite close.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
It shows that there are 5 types of numerical quality in this data set ranging from 3 to 8 and most values of quality are 5 & 6.
## Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
Histograms depicts that density, pH and quality have similar structure i.e in normalised form. Others have typical structure some are skewed to left, some have oultiers mostly sulphur related factors, chlroides and residual sugar. Citric acid contains many null values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.270 7.827 8.720 9.118 10.070 17.050
As all the three factors fixed acidity, volatile acidity and citric acid comprises of acidic features and also vary from structure like quality, creating a new variable total_acidity as a sum of all of these factors.
## low avg high
## 63 1319 217
Quality variable has a discrete range of only 3-8, Majority of the wines examined got ratings of 5 or 6, and very less got 3, 4, or 8. So grouping the quality into a new variable review as ‘low’ (review 0 to 4), ‘avg’ (review 5 or 6), and ‘high’ (review 7 to 10).
Boxplots justify the results from histograms and show outliers frequency in each variable residual sugar, chlorides, sulphates tend to have many outliers.
## 'data.frame': 1599 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ total_acidity : num 8.1 8.68 8.6 12.04 8.1 ...
## $ review : Ord.factor w/ 3 levels "low"<"avg"<"high": 2 2 2 2 2 2 2 3 3 2 ...
Main features of interest is the ‘quality’ and ‘review’ as main focus is to analyze how wine quality and its review is affecting with other factors. Also quality shows quite normal distribution where the bulk of the observations are in the 5-6 range.
Its difficult to find out from Univariate Analysis but density, pH, total_acidity can help due to their similar structure with quality.
Yes I created two new variables ‘review’ and ‘total_acidity’ which I have explained above in New features section.
The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.
132 null values are removed after this scaling in citric acid. The dataset in general was fairly tidy such that additional auditing and cleaning was not needed. Some outliers are there but they can be adjusted in other analysis without any problem.
Making scatter plot of some interesting variables in dataset
The bivariate plots began with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration. Still, the plot was very untidy and difficult to understand and deduce any result from that.
From exploring these boxplots, it seems that a high quality red wine generally has these properties:
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## total_acidity residual.sugar chlordies
## 0.10375373 0.01373164 -0.12890656
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH sulphates alcohol
## -0.05773139 0.25139708 0.47616632
Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:
All the above plots justify the correlation that how other variables increase or decrease with quality.
Plotting relationships b/w the variables having high correlation with quality.
These scatterplots shows that alcohol, sulphates, citric acid and volatile.acidity are highly correlated factors and all of them affect most of the results alcohol, citric.acid and sulphates in positive way and volatile.acidity in negative way therefore balanced b/w the factors is necessary for best results in wine quality.
From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 25.14%. Quality doesn’t depend much on density
the logarithmic relationship of acidity and pH were observed.
## cor
## -0.7044435
It justifies the relation of acidity with pH as its logarithm is inversely proportional to pH scale
Also alcohol and volatile.acidity are correlated
## cor
## -0.202288
Citric Acid and volatile.acidity correlations
## cor
## -0.5524957
The strongest relationship was b/w alcohol and quality i.e 0.47616632 correlation which implies quality improves with alcholic content. quality declines with increase in volatile acidity with -0.39055778 correlation.
Also sulphates(0.25139708), citric acid(0.22637251), total.sulfur.dioxide (-0.18510029), density(-0.17491923) are related with quality on that correlation.
Let’s see how these variables compare, plotted against each other and faceted by wine rating and coloured by wine quality.
This plot depicts that alcohol is more positively correlated with quality than sulphates but still increase in levels of both the factors improve quality of wine.
It depicts that when alcohol values are high and volatile acidity is low then high quality wines will be formed. And for average wine quality both the factors should be balanced.
Both the factors alcohol and citric acid are positively correlated with quality.
Sulphates don’t show much correlation with volatile acidity alone, volatile acidity lowers the quality of wine more prominently than sulphates.
Both the factors citric acid and sulphates together lead to increase the quality of the red wine, but citric acid has more effect than sulphates in increasing the quality of the wine.
Correlation b/w citric acid and volatile acidity is -0.55 and both the factors affect quality in positive and negative aspects.
Two main features I found out which effects wine quality a lot are alcohol and volatile acidity so lets plot b/w them at extreme wine review i.e which makes wine more low or more high.
The plot clearly depicts that quality review becomes high when when alcohol volume is high and becomes low when volatile acidity content becomes low.
Also alcohol and sulphates together affect a lot in quality review in positive direction so lets visualize how they affect quality extremities together.
Above plot depicts that both factors alcohol and sulphates at high levelsresult in high quality review and vice versa.
Above plots show that volatile acidity is very negatively correlated with quality and other positive factors. Volatile acidity makes not only quality of beer low but affects citric acid, alcohol, sulphates and many other positive features.
For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three review categories. It resulted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This analysis is made so far.
Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid in Multivariate plots section clearly show their lack of correlation to each other.
Also not much effect of pH and total_acidity is found on visualisation as pH range is small b/w 3 to 4 hence not affecting much of quality and due to correlation of pH and total acidity, it also doesn’t affect the result much
No, I didn’t create any model.
These plots were created to demonstrate the effect of acidity on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid and total_acidity had a smaller impact.
This is perhaps the most descriptive visualisation. I subsetted the data to remove the ‘average’ quality wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volatile acidity.It shows that high volatile acidity kept wine quality down and vice-versa. A combination of high alcohol content and low volatile acidity produced better wines with few outliers.
Its most interesting and important visualisation that shows good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axis, whereby the top right quadrant has a large density of ‘high’ wine ratings.
Through this exploratory data analysis, I was able to identify the key factors that are correlated with red wine quality, i.e, alcohol , sulphates, and acidity.
I faced difficulty in plotting ggpairs scatterplot it was very complicated and I simplified it using limited variables for plot
I founded that Alcohol, citric acid, sulphates are positively correlated with quality Volatile acidity alone has a lot of negative correlation with quality.
I mainly used Scatterplots, Boxplots and histograms for exploratory visualization of this dataset. The final plots depict the relationship of acidity to a good wine, and most importantly, such a wine will likely contain high alcohol content, high sulphates and low volatile acidity.
There should be more information in the dataset like oxidation factors of wine which really affects it quality because oxidation develops and adds aromatic complexity. As a result, the wines become more flavorful and earthy. In red wines, it softens the tannins and stabilizes color.
Now for future work using these factors as the features, a predictive model can be made using machine learning algorithms which predicts that what quality review should be given to beer with certain features.